This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # Data manipulation
library(obisdi) # Tools for data ingestion for OBIS
library(here) # Get paths (important!)
library(arrow) # To deal with parquet files
The checklist will be downloaded from FigShare. We use the
obisdi function to do the download and also to obtain
metadata. Because the files are large, we added a line to control and
only download the data once and save the resulting metadata:
# Get the path to data/raw
raw_path <- here("data", "raw")
# See if files were already downloaded
lf <- list.files(raw_path)
if (!any(grepl("figshare", lf))) {
fig_details <- get_figshare(article_id = 21997559, download_files = T,
save_meta = T, path = raw_path)
}
Following the download the details of the dataset can be accessed from the file data/raw/figshare_metadata_09062023.csv.
Title: Global cold-water coral diversity dataset
Authors: Eliza Fragkopoulou, Viktoria Balogh, Ester Serrão, Jorge
Assis
Date (dmy format): 22/02/2023
DOI: 10.6084/m9.figshare.21997559.v2
URL: https://figshare.com/articles/dataset/A_comprehensive_dataset_of_the_cold-water_coral_diversity/21997559
First we reduce the size of the raw files by converting them to the
parquet format. We keep only the flagged file which is the
one that we will include in the OBIS database.
raw_files <- list.files(raw_path, full.names = T)
file.remove(raw_files[-grep("Flagged|metadata", raw_files)])
# We just run the conversion in the first knitting of this document
if (any(grepl("xlsx", raw_files))) {
flagged <- readxl::read_xlsx(paste0(raw_path, "/databaseFlaggedFinal.xlsx"))
write_parquet(flagged, paste0(raw_path, "/databaseFlagged.parquet"))
rm(flagged)
file.remove(paste0(raw_path, "/databaseFlaggedFinal.xlsx"))
}
Now we can load the parquet file containing the dataset we will work with.
dataset <- read_parquet(paste0(raw_path, "/databaseFlagged.parquet"))
head(dataset)
We will filter the dataset to remove those records that are already available on OBIS. This dataset contains records that werre directly derived from OBIS, but also some from other OBIS derived sources (like MedOBIS). In that case, we will filter by “Ocean Biogeographic Information System” (old name) and “Ocean Biodiversity Information System”.
dataset_filt <- dataset %>%
mutate(proc_bibliographicCitation = tolower(bibliographicCitation)) %>%
filter(!grepl("ocean biogeographic information system|ocean biodiversity information system", proc_bibliographicCitation)) %>%
select(-proc_bibliographicCitation)
This is a dataset that was already standardized to the DwC standard
and it’s almost ready to use. However, the MeasurementOrFact column will
need to be translated to the right format. We start by creating a new
object with the occurrenceID and the MoF column, and then
by separating each MoF into a new column.
flags <- dataset_filt %>%
select(occurrenceID, MeasurementOrFact)
flags <- flags %>%
separate_wider_delim(cols = 2, delim = ",",
names = c("flagLand", "flagVerticalRange", "flagGeographicRange"))
flags_conv <- flags %>%
mutate(flagLand = str_remove(str_remove(flagLand, '\\{\\".*\\"\\:\\['), "\\]"),
flagVerticalRange = str_remove(str_remove(flagVerticalRange, '\\".*\\"\\:\\['), "\\]"),
flagGeographicRange = str_remove(str_remove(flagGeographicRange, '\\".*\\"\\:\\['), "\\].")) %>%
mutate(across(2:4, function(x){
x <- gsub('\\"', "", x)
as.numeric(x)
}))
We can check that the conversion worked by looking into the unique values of each one:
table(flags$flagLand)
##
## {"flagLand":["-1"] {"flagLand":["1"]
## 93772 458379
table(flags_conv$flagLand)
##
## -1 1
## 93772 458379
table(flags$flagVerticalRange)
##
## "flagVerticalRange":["-1"] "flagVerticalRange":["1"]
## 152923 399228
table(flags_conv$flagVerticalRange)
##
## -1 1
## 152923 399228
table(flags$flagGeographicRange)
##
## "flagGeographicRange":["-1"]} "flagGeographicRange":["1"]}
## 82697 469454
table(flags_conv$flagGeographicRange)
##
## -1 1
## 82697 469454
As we said, this dataset is already on the DwC standard, so no mapping will be necessary. However, the object we created with the flags is not on the STAR format. We adjust that with the following code:
flags_conv <- flags_conv %>%
pivot_longer(cols = 2:4,
names_to = "measurementType",
values_to = "measurementValue")
That’s all we needed to do with the data.
As a final step, we just remove the MeasurementOrFact
column of the other object, as this will be supplied to the IPT in a
different file.
dataset_filt <- dataset_filt %>%
select(-MeasurementOrFact)
And those are the final objects:
dataset_filt
flags_conv
We can then save the final files:
processed_path <- here("data", "processed")
write_csv(flags_conv, paste0(processed_path, "/extension.csv"))
# Because the occurrence table is large and GitHub have size limitations
# we split the file into multiple files
div <- ceiling(nrow(dataset_filt)/10)
splits <- seq(1, nrow(dataset_filt), by = div)
splits[length(splits)] <- nrow(dataset_filt)
splits <- splits[-1]
st <- 1
for (i in splits) {
write_csv(dataset_filt[st:i,],
paste0(processed_path, "/occurrence_part", which(splits == i),".csv"))
st <- i+1
}
# We can do a quick check that everything went ok:
lf <- list.files(processed_path, "occurrence_part", full.names = T)
parts <- do.call("rbind", lapply(lf, read.csv))
nrow(dataset_filt) == nrow(parts)
## [1] TRUE
all.equal(dataset_filt$occurrenceID,
parts$occurrenceID)
## [1] TRUE
And we check if the files are saved:
list.files(processed_path)
## [1] "extension.csv" "occurrence_part1.csv" "occurrence_part2.csv"
## [4] "occurrence_part3.csv" "occurrence_part4.csv" "occurrence_part5.csv"
## [7] "occurrence_part6.csv" "occurrence_part7.csv" "occurrence_part8.csv"
## [10] "occurrence_part9.csv"
